經典網路研究與轉移特徵學習

林嶔 (Lin, Chin)

Lesson 8

深度學習網路發展史-醞釀時期(1)

– 首先第一個關鍵時刻當屬1986年由David Rumelhart、Geoffrey Hinton、Ronald Williams在1986年的研究,神經網路的實現相當於是在這個研究後才有實現的可能性。

F8_1

– Geoffrey Hinton透過這篇研究以及後續的努力被人們尊稱為「神經網路之父」、「深度學習之父」等!

深度學習網路發展史-醞釀時期(2)

– 另外在這個年代的神經網路由於結構較為簡單,因此並沒有辦法做出任意形式的輸入/輸出映射,所以與一般統計方法差異有限。而大約在1990年代開始,統計學家在電腦的輔助下發展出了許多準確性遠遠超越多層感知機(Multilayer Perceptron,MLP),這個時代的重大突破如下:

  1. 支持向量機(Support Vector Machine,SVM) - 這可以說是當代在數理統計上最創新的大作,分類效果也極佳

F8_2

  1. 隨機森林(Random Forest,RF)

  2. 梯度提升機(Gradient Bboosting Machine)

深度學習網路發展史-醞釀時期(3)

– 我們在前面的課程中已經講到了,Yann Lecun在1989年所發展的卷積神經網路(Convolutional Neural Network,CNN)做到了這一點,而Yann Lecun也通過這一個重要的成就被人稱為「卷積神經網路之父」!

F8_3

F6_9

深度學習網路發展史-醞釀時期(4)

– 在這個年代只有Geoffrey Hinton堅持了下來,他試圖提出其他的優化方法來解決反向傳播法所遇到的梯度消失問題,並在2006年成功的利用分層訓練再組合起來的方式訓練了一隻「深度神經網路」(其實也不超過10層)。

F8_4

深度學習網路發展史-醞釀時期(5)

– 事實上我們從事後的角度來說,深度學習是一個需要數據量才能體現出的強大算法,假設我們一直沒有足夠的數據量,那深度學習將永無抬頭之日!

F8_5

F8_6

F8_7

深度學習網路發展史-奠基時期(1)

– 下面這張圖是每一年的冠軍演算法,我們可以看到在2012年以前,這項比賽大多是由SVM、隨機森林等方法獲得冠軍,但自2012年以來卷積神經網路就席捲了ILSVRC之後所有的冠軍。

F8_8

F8_9

深度學習網路發展史-奠基時期(2)

F8_10

  1. 使用ReLU做為非線性變換的激活函數 - 這點可以說是小幅度的解決了梯度消失問題,使網路總深度達到了8層

  2. 使用Dropout技術 - 這可以說是整個研究最創新的點,一定程度避免了過度擬和的危害

  3. 使用overlap的max pooling - 這是一個新的觀念,但實現上並沒有非常困難

  4. 數據增強 - 這個研究所使用的數據增強技術即使到今天都可以算是非常完整,包含了裁減、旋轉、翻轉、縮放、ZCA白化等一系列操作

  5. 使用GPU加速深度卷積網絡的訓練 - 這在當時是一個門檻較高且大家沒有想到的方向,有效的加速了神經網路的訓練

  6. 提出了一種叫做局部響應歸一化(Local response normalization,LRN)層 - 這是一種模仿生物學中相鄰的神經元有較強的訊號時會抑制旁邊較弱訊號的手段,然而後續的研究被證明用處不大

– Geoffrey Hinton也就是因為在神經網路的發展史中連續出現在3個關鍵時刻,而被人們尊稱為「神經網路之父」、「深度學習之父」!

深度學習網路發展史-奠基時期(3)

– 然而在之前的研究早已證實非線性的結構有助於模擬更複雜的函數模型,因此我們必須在網路內多增加更多非線性的結構以利預測!

F8_11

F8_12

– 這篇研究最重要的貢獻在於提出並以實驗證明了1×1的卷積核的好處,未來的網路大量運用了這一觀念在Model Architecture之中。時至今日,目前大多數最先進的網路中1×1的卷積核的使用量甚至都超過其他維度的卷積核使用!

深度學習網路發展史-奠基時期(4)

– 回頭看看AlexNet的結構,你是否難以理解到底哪裡應該使用11×11的卷積核,而那裡又該用3×3或是5×5呢?這導致定義Model Architecture的選擇過於發散,因此他們做了一個重要的實驗來解決這個問題。

F8_13

F8_14

F8_15

– 這篇研究透過比較了上述6個神經網路,告訴了我們幾個未來設計Model Architecture的重要事項:

  1. 越深的網絡效果通常越好

  2. 1x1的卷積核也顯著提升效能(與前面的研究結論相同)

  3. 局部響應歸一化層對網路的性能提升沒什麼幫助

深度學習網路發展史-奠基時期(5)

F8_16

F8_17

– 也由於前面Network In Network的重要貢獻,最終的Inception Module在每一個通道上都加上了一個1x1卷積層以達到非線性擬合的目標:

F8_18

深度學習網路發展史-奠基時期(6)

– Diederik P. Kingma與Jimmy Lei Ba所提出的研究:Adam: A Method for Stochastic Optimization提出了著名的Adam

F8_19

深度學習網路發展史-奠基時期(7)

– 因此,在2014年獲得冠軍的那個網路被稱作Inception v1 net,而在2015年時Google團隊在之後又開發出了Inception v2 net以及Inception v3 net。

F8_20

– 如同之前所提到的,在2014年底時Inception v1 net已經取得了正確率93.3%的成績,這篇研究僅在GoogleNet的基礎上並加上批量標準化技術就達到了正確率95.2%,這是史上第一篇超越人類正確率(~95.0%)的研究。

F8_21

– Google團隊在開發出Inception v3 net之後就將其投入至2015年的ILSVRC之中,然而由於研究突破相對小,並且同一次比賽遇到了深度學習目前為止史上最大的核彈級突破,因此就淹沒於歷史的長河之中…

深度學習網路發展史-爆發時期(1)

– 比起競賽獲勝並正式超越人類之外更重要的意義是,我們首次真正意義上的解決了梯度消失問題,而他們所發展的Residual Learning成功地訓練了一隻1000層深的網路,並且同一個時間幾乎沒有團隊有能力訓練超過50層的神經網路。

– 這個核彈級的研究:Deep Residual Learning for Image Recognition在所有人的引頸盼望之下,發表於2016年的CVPR並理所當然的獲得了該研討會的最佳會議論文獎:

F8_22

F8_23

F8_24

深度學習網路發展史-爆發時期(2)

– 然而在梯度消失問題存在的時刻,我們根本無法創造出足夠複雜(夠深)的神經網路,從而使得某些非常困難的預測難以做到,像是物件識別(Object detection)。

F8_25

F8_26

F8_27

F8_36

– 這是他能達到的效果:

F8_35

深度學習網路發展史-爆發時期(3)

F8_32

F8_33

F8_34

深度學習網路發展史-爆發時期(4)

F8_28

F8_29

F8_30

F8_31

深度學習網路發展史-爆發時期(5)

– 這是在2015年的ResNet後,最後一個能大幅提升神經網路預測能力的結構,這個結構在2017年最後一屆的ILSVRC一舉奪冠並再度刷新了人類史上最佳的表現,他的細節被公布在自動駕駛公司Momenta在2017年所發表的論文:Squeeze-and-Excitation Networks之中。

F8_37

深度學習網路發展史-爆發時期(6)

– 模型準確性研究

  1. Densely Connected Convolutional Networks(2016年發表,引用次數超過750次)

  2. Xception: Deep Learning with Depthwise Separable Convolutions(2016年發表,引用次數超過200次)

  3. Dual Path Networks(2017年發表,引用次數超過300次)

– 模型輕量化研究

  1. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size(2016年發表,引用次數超過500次)

  2. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications(2017年發表,引用次數超過350次)

  3. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices(2017年發表,引用次數超過100次)

  4. MobileNetV2: Inverted Residuals and Linear Bottlenecks(2018年發表)

利用經典模型的model進行預測(1)

– 我們可以下載resnet-18模型進行預測,編號與Label對照表請訪問這裡下載(你可以下載中文版的)

library(mxnet)
library(imager)
library(magrittr)

#Load a pre-training residual network model

res_model = mx.model.load("model/resnet-18", 0)
res_sym = mx.symbol.load("model/resnet-18-symbol.json")
label_names = readLines("model/synset.txt")

#Define image processing functions

preproc.image <- function(im) {
  resized <- resize(im, 224, 224)
  resized <- as.array(resized) * 255
  # Reshape to format needed by mxnet (width, height, channel, num)
  dim(resized) <- c(224, 224, 3, 1)
  return(resized)
}

#Read image # Display image

img <- load.image(system.file("extdata/parrots.png", package="imager"))

#Pre-processing

normed <- preproc.image(img)

#Display image

par(mar=rep(0,4))
plot(NA, xlim = c(0.04, 0.96), ylim = c(0.04, 0.96), xaxt = "n", yaxt = "n", bty = "n")
rasterImage(img, 0, 0, 1, 1, interpolate=FALSE)

#Predict

prob <- predict(res_model, X = normed, ctx = mx.cpu())
cat(paste0(label_names[which.max(prob)], ': ', formatC(max(prob), 4, format = 'f'), '\n'))
## n01818515 macaw: 0.9956

利用經典模型的model進行預測(2)

– 你可能不清楚這個下載來的網路結構長甚麼樣子,可以用函數「graph.viz」看看:

res_sym = mx.symbol.load("model/resnet-18-symbol.json")
graph.viz(res_sym)

F8_38

利用經典模型的model進行預測(3)

– 你可能會不清楚裡面細節參數是怎樣知道的,可以用記事本打開JSON檔案進行查看。

# Model Architecture

# 224×224

data <- mx.symbol.Variable(name = 'data')
bn_data <- mx.symbol.BatchNorm(data = data, eps = "2e-05", name = 'bn_data')

# 112×112

conv0 <- mx.symbol.Convolution(data = bn_data, no_bias = TRUE, name = 'conv0',
                               kernel = c(7, 7), pad = c(3, 3), stride = c(2, 2), num_filter = 64)
bn0 <- mx.symbol.BatchNorm(data = conv0, fix_gamma = FALSE, eps = "2e-05", name = 'bn0')
relu0 <- mx.symbol.Activation(data = bn0, act_type = "relu", name = 'relu0')

# 56×56

# stage1_unit1

pooling0 <- mx.symbol.Pooling(data = relu0, pool_type = "max", name = 'pooling0',
                              kernel = c(3, 3), pad = c(1, 1), stride = c(2, 2))
stage1_unit1_bn1 <- mx.symbol.BatchNorm(data = pooling0, fix_gamma = FALSE, eps = "2e-05", name = 'stage1_unit1_bn1')
stage1_unit1_relu1 <- mx.symbol.Activation(data = stage1_unit1_bn1, act_type = "relu", name = 'stage1_unit1_relu1')
stage1_unit1_conv1 <- mx.symbol.Convolution(data = stage1_unit1_relu1, no_bias = TRUE, name = 'stage1_unit1_conv1',
                                            kernel = c(3, 3), pad = c(1, 1), stride = c(1, 1), num_filter = 64)
stage1_unit1_bn2 <- mx.symbol.BatchNorm(data = stage1_unit1_conv1, fix_gamma = FALSE, eps = "2e-05", name = 'stage1_unit1_bn2')
stage1_unit1_relu2 <- mx.symbol.Activation(data = stage1_unit1_bn2, act_type = "relu", name = 'stage1_unit1_relu2')
stage1_unit1_conv2 <- mx.symbol.Convolution(data = stage1_unit1_relu2, no_bias = TRUE, name = 'stage1_unit1_conv2',
                                            kernel = c(3, 3), pad = c(1, 1), stride = c(1, 1), num_filter = 64)

stage1_unit1_sc <- mx.symbol.Convolution(data = stage1_unit1_relu1, no_bias = TRUE, name = 'stage1_unit1_sc',
                                         kernel = c(1, 1), pad = c(0, 0), stride = c(1, 1), num_filter = 64)

elemwise_add_plus0 <- mx.symbol.broadcast_plus(lhs = stage1_unit1_conv2, rhs = stage1_unit1_sc, name = 'elemwise_add_plus0')

# stage1_unit2

stage1_unit2_bn1 <- mx.symbol.BatchNorm(data = elemwise_add_plus0, fix_gamma = FALSE, eps = "2e-05", name = 'stage1_unit2_bn1')
stage1_unit2_relu1 <- mx.symbol.Activation(data = stage1_unit2_bn1, act_type = "relu", name = 'stage1_unit2_relu1')
stage1_unit2_conv1 <- mx.symbol.Convolution(data = stage1_unit2_relu1, no_bias = TRUE, name = 'stage1_unit2_conv1',
                                            kernel = c(3, 3), pad = c(1, 1), stride = c(1, 1), num_filter = 64)
stage1_unit2_bn2 <- mx.symbol.BatchNorm(data = stage1_unit2_conv1, fix_gamma = FALSE, eps = "2e-05", name = 'stage1_unit2_bn2')
stage1_unit2_relu2 <- mx.symbol.Activation(data = stage1_unit2_bn2, act_type = "relu", name = 'stage1_unit2_relu2')
stage1_unit2_conv2 <- mx.symbol.Convolution(data = stage1_unit2_relu2, no_bias = TRUE, name = 'stage1_unit2_conv2',
                                            kernel = c(3, 3), pad = c(1, 1), stride = c(1, 1), num_filter = 64)

elemwise_add_plus1 <- mx.symbol.broadcast_plus(lhs = stage1_unit2_conv2, rhs = elemwise_add_plus0, name = 'elemwise_add_plus1')

# 28×28

# stage2_unit1

stage2_unit1_bn1 <- mx.symbol.BatchNorm(data = elemwise_add_plus1, fix_gamma = FALSE, eps = "2e-05", name = 'stage2_unit1_bn1')
stage2_unit1_relu1 <- mx.symbol.Activation(data = stage2_unit1_bn1, act_type = "relu", name = 'stage2_unit1_relu1')
stage2_unit1_conv1 <- mx.symbol.Convolution(data = stage2_unit1_relu1, no_bias = TRUE, name = 'stage2_unit1_conv1',
                                            kernel = c(3, 3), pad = c(1, 1), stride = c(2, 2), num_filter = 128)
stage2_unit1_bn2 <- mx.symbol.BatchNorm(data = stage2_unit1_conv1, fix_gamma = FALSE, eps = "2e-05", name = 'stage2_unit1_bn2')
stage2_unit1_relu2 <- mx.symbol.Activation(data = stage2_unit1_bn2, act_type = "relu", name = 'stage2_unit1_relu2')
stage2_unit1_conv2 <- mx.symbol.Convolution(data = stage2_unit1_relu2, no_bias = TRUE, name = 'stage2_unit1_conv2',
                                            kernel = c(3, 3), pad = c(1, 1), stride = c(1, 1), num_filter = 128)

stage2_unit1_sc <- mx.symbol.Convolution(data = stage2_unit1_relu1, no_bias = TRUE, name = 'stage2_unit1_sc',
                                         kernel = c(1, 1), pad = c(0, 0), stride = c(2, 2), num_filter = 128)

elemwise_add_plus2 <- mx.symbol.broadcast_plus(lhs = stage2_unit1_conv2, rhs = stage2_unit1_sc, name = 'elemwise_add_plus2')

# stage2_unit2

stage2_unit2_bn1 <- mx.symbol.BatchNorm(data = elemwise_add_plus2, fix_gamma = FALSE, eps = "2e-05", name = 'stage2_unit2_bn1')
stage2_unit2_relu1 <- mx.symbol.Activation(data = stage2_unit2_bn1, act_type = "relu", name = 'stage2_unit2_relu1')
stage2_unit2_conv1 <- mx.symbol.Convolution(data = stage2_unit2_relu1, no_bias = TRUE, name = 'stage2_unit2_conv1',
                                            kernel = c(3, 3), pad = c(1, 1), stride = c(1, 1), num_filter = 128)
stage2_unit2_bn2 <- mx.symbol.BatchNorm(data = stage2_unit2_conv1, fix_gamma = FALSE, eps = "2e-05", name = 'stage2_unit2_bn2')
stage2_unit2_relu2 <- mx.symbol.Activation(data = stage2_unit2_bn2, act_type = "relu", name = 'stage2_unit2_relu2')
stage2_unit2_conv2 <- mx.symbol.Convolution(data = stage2_unit2_relu2, no_bias = TRUE, name = 'stage2_unit2_conv2',
                                            kernel = c(3, 3), pad = c(1, 1), stride = c(1, 1), num_filter = 128)

elemwise_add_plus3 <- mx.symbol.broadcast_plus(lhs = stage2_unit2_conv2, rhs = elemwise_add_plus2, name = 'elemwise_add_plus3')

# 14×14

# stage3_unit1

stage3_unit1_bn1 <- mx.symbol.BatchNorm(data = elemwise_add_plus3, fix_gamma = FALSE, eps = "2e-05", name = 'stage3_unit1_bn1')
stage3_unit1_relu1 <- mx.symbol.Activation(data = stage3_unit1_bn1, act_type = "relu", name = 'stage3_unit1_relu1')
stage3_unit1_conv1 <- mx.symbol.Convolution(data = stage3_unit1_relu1, no_bias = TRUE, name = 'stage3_unit1_conv1',
                                            kernel = c(3, 3), pad = c(1, 1), stride = c(2, 2), num_filter = 256)
stage3_unit1_bn2 <- mx.symbol.BatchNorm(data = stage3_unit1_conv1, fix_gamma = FALSE, eps = "2e-05", name = 'stage3_unit1_bn2')
stage3_unit1_relu2 <- mx.symbol.Activation(data = stage3_unit1_bn2, act_type = "relu", name = 'stage3_unit1_relu2')
stage3_unit1_conv2 <- mx.symbol.Convolution(data = stage3_unit1_relu2, no_bias = TRUE, name = 'stage3_unit1_conv2',
                                            kernel = c(3, 3), pad = c(1, 1), stride = c(1, 1), num_filter = 256)

stage3_unit1_sc <- mx.symbol.Convolution(data = stage3_unit1_relu1, no_bias = TRUE, name = 'stage3_unit1_sc',
                                         kernel = c(1, 1), pad = c(0, 0), stride = c(2, 2), num_filter = 256)

elemwise_add_plus4 <- mx.symbol.broadcast_plus(lhs = stage3_unit1_conv2, rhs = stage3_unit1_sc, name = 'elemwise_add_plus4')

# stage3_unit2

stage3_unit2_bn1 <- mx.symbol.BatchNorm(data = elemwise_add_plus4, fix_gamma = FALSE, eps = "2e-05", name = 'stage3_unit2_bn1')
stage3_unit2_relu1 <- mx.symbol.Activation(data = stage3_unit2_bn1, act_type = "relu", name = 'stage3_unit2_relu1')
stage3_unit2_conv1 <- mx.symbol.Convolution(data = stage3_unit2_relu1, no_bias = TRUE, name = 'stage3_unit2_conv1',
                                            kernel = c(3, 3), pad = c(1, 1), stride = c(1, 1), num_filter = 256)
stage3_unit2_bn2 <- mx.symbol.BatchNorm(data = stage3_unit2_conv1, fix_gamma = FALSE, eps = "2e-05", name = 'stage3_unit2_bn2')
stage3_unit2_relu2 <- mx.symbol.Activation(data = stage3_unit2_bn2, act_type = "relu", name = 'stage3_unit2_relu2')
stage3_unit2_conv2 <- mx.symbol.Convolution(data = stage3_unit2_relu2, no_bias = TRUE, name = 'stage3_unit2_conv2',
                                            kernel = c(3, 3), pad = c(1, 1), stride = c(1, 1), num_filter = 256)

elemwise_add_plus5 <- mx.symbol.broadcast_plus(lhs = stage3_unit2_conv2, rhs = elemwise_add_plus4, name = 'elemwise_add_plus5')

# 7×7

# stage4_unit1

stage4_unit1_bn1 <- mx.symbol.BatchNorm(data = elemwise_add_plus5, fix_gamma = FALSE, eps = "2e-05", name = 'stage4_unit1_bn1')
stage4_unit1_relu1 <- mx.symbol.Activation(data = stage4_unit1_bn1, act_type = "relu", name = 'stage4_unit1_relu1')
stage4_unit1_conv1 <- mx.symbol.Convolution(data = stage4_unit1_relu1, no_bias = TRUE, name = 'stage4_unit1_conv1',
                                            kernel = c(3, 3), pad = c(1, 1), stride = c(2, 2), num_filter = 512)
stage4_unit1_bn2 <- mx.symbol.BatchNorm(data = stage4_unit1_conv1, fix_gamma = FALSE, eps = "2e-05", name = 'stage4_unit1_bn2')
stage4_unit1_relu2 <- mx.symbol.Activation(data = stage4_unit1_bn2, act_type = "relu", name = 'stage4_unit1_relu2')
stage4_unit1_conv2 <- mx.symbol.Convolution(data = stage4_unit1_relu2, no_bias = TRUE, name = 'stage4_unit1_conv2',
                                            kernel = c(3, 3), pad = c(1, 1), stride = c(1, 1), num_filter = 512)

stage4_unit1_sc <- mx.symbol.Convolution(data = stage4_unit1_relu1, no_bias = TRUE, name = 'stage4_unit1_sc',
                                         kernel = c(1, 1), pad = c(0, 0), stride = c(2, 2), num_filter = 512)

elemwise_add_plus6 <- mx.symbol.broadcast_plus(lhs = stage4_unit1_conv2, rhs = stage4_unit1_sc, name = 'elemwise_add_plus6')

# stage4_unit2

stage4_unit2_bn1 <- mx.symbol.BatchNorm(data = elemwise_add_plus6, fix_gamma = FALSE, eps = "2e-05", name = 'stage4_unit2_bn1')
stage4_unit2_relu1 <- mx.symbol.Activation(data = stage4_unit2_bn1, act_type = "relu", name = 'stage4_unit2_relu1')
stage4_unit2_conv1 <- mx.symbol.Convolution(data = stage4_unit2_relu1, no_bias = TRUE, name = 'stage4_unit2_conv1',
                                            kernel = c(3, 3), pad = c(1, 1), stride = c(1, 1), num_filter = 512)
stage4_unit2_bn2 <- mx.symbol.BatchNorm(data = stage4_unit2_conv1, fix_gamma = FALSE, eps = "2e-05", name = 'stage4_unit2_bn2')
stage4_unit2_relu2 <- mx.symbol.Activation(data = stage4_unit2_bn2, act_type = "relu", name = 'stage4_unit2_relu2')
stage4_unit2_conv2 <- mx.symbol.Convolution(data = stage4_unit2_relu2, no_bias = TRUE, name = 'stage4_unit2_conv2',
                                            kernel = c(3, 3), pad = c(1, 1), stride = c(1, 1), num_filter = 512)

elemwise_add_plus7 <- mx.symbol.broadcast_plus(lhs = stage4_unit2_conv2, rhs = elemwise_add_plus6, name = 'elemwise_add_plus7')

# Final

bn1 <- mx.symbol.BatchNorm(data = elemwise_add_plus7, fix_gamma = FALSE, eps = "2e-05", name = 'bn1')
relu1 <- mx.symbol.Activation(data = bn1, act_type = "relu", name = 'relu1')
pool1 <- mx.symbol.Pooling(data = relu1, pool_type = "avg", name = 'pool1',
                           kernel = c(7, 7), pad = c(0, 0), stride = c(7, 7))
flatten0 <- mx.symbol.Flatten(data = pool1, name = 'flatten0')
fc1 <- mx.symbol.FullyConnected(data = flatten0, num_hidden = 1000, name = 'fc1')
softmax <- mx.symbol.softmax(data = fc1, axis = 1, name = 'softmax')
res_model$symbol <- softmax

prob <- predict(res_model, X = normed, ctx = mx.cpu())
cat(paste0(label_names[which.max(prob)], ': ', formatC(max(prob), 4, format = 'f'), '\n'))
## n01818515 macaw: 0.9956

利用經典模型的model進行預測(4)

– 我們可以利用下列程式碼把他取出flatten0的輸出來,需要特別注意的是,由於網路並不會用到fc1_weight以及fc1_bias兩個權重參數,所以在預測前我們必須刪除他們:

my_model <- res_model

my_model$symbol <- flatten0
my_model$arg.params <- my_model$arg.params[names(my_model$arg.params) %in% names(mx.symbol.infer.shape(flatten0, data = c(224, 224, 3, 7))$arg.shapes)]
my_model$aux.params <- my_model$aux.params[names(my_model$aux.params) %in% names(mx.symbol.infer.shape(flatten0, data = c(224, 224, 3, 7))$aux.shapes)]

features <- predict(my_model, X = normed, ctx = mx.cpu())
dim(features)
## [1] 512   1

利用經典模型的model進行預測(5)

#Get features

all_layers = res_sym$get.internals()
tail(all_layers$outputs, 30)
##  [1] "stage4_unit2_bn1_gamma"       "stage4_unit2_bn1_beta"       
##  [3] "stage4_unit2_bn1_moving_mean" "stage4_unit2_bn1_moving_var" 
##  [5] "stage4_unit2_bn1_output"      "stage4_unit2_relu1_output"   
##  [7] "stage4_unit2_conv1_weight"    "stage4_unit2_conv1_output"   
##  [9] "stage4_unit2_bn2_gamma"       "stage4_unit2_bn2_beta"       
## [11] "stage4_unit2_bn2_moving_mean" "stage4_unit2_bn2_moving_var" 
## [13] "stage4_unit2_bn2_output"      "stage4_unit2_relu2_output"   
## [15] "stage4_unit2_conv2_weight"    "stage4_unit2_conv2_output"   
## [17] "_plus7_output"                "bn1_gamma"                   
## [19] "bn1_beta"                     "bn1_moving_mean"             
## [21] "bn1_moving_var"               "bn1_output"                  
## [23] "relu1_output"                 "pool1_output"                
## [25] "flatten0_output"              "fc1_weight"                  
## [27] "fc1_bias"                     "fc1_output"                  
## [29] "softmax_label"                "softmax_output"
#Get symbol

flatten0_output = which(all_layers$outputs == 'flatten0_output') %>% all_layers$get.output()

my_model <- res_model
my_model$symbol <- flatten0_output
my_model$arg.params <- my_model$arg.params[names(my_model$arg.params) %in% names(mx.symbol.infer.shape(flatten0_output, data = c(224, 224, 3, 7))$arg.shapes)]
my_model$aux.params <- my_model$aux.params[names(my_model$aux.params) %in% names(mx.symbol.infer.shape(flatten0_output, data = c(224, 224, 3, 7))$aux.shapes)]

features <- predict(my_model, X = normed, ctx = mx.cpu())
dim(features)
## [1] 512   1

利用經典模型的model進行預測(6)

flatten0_output = which(all_layers$outputs == 'flatten0_output') %>% all_layers$get.output()
softmax_output = which(all_layers$outputs == 'softmax_output') %>% all_layers$get.output()
out = mx.symbol.Group(c(flatten0_output, softmax_output))
executor = mx.simple.bind(symbol = out, data = c(224, 224, 3, 1), ctx = mx.cpu())

mx.exec.update.arg.arrays(executor, res_model$arg.params, match.name = TRUE)
mx.exec.update.aux.arrays(executor, res_model$aux.params, match.name = TRUE)
mx.exec.update.arg.arrays(executor, list(data = mx.nd.array(normed)), match.name = TRUE)
mx.exec.forward(executor, is.train = FALSE)

features = as.array(executor$ref.outputs$flatten0_output)
dim(features)
## [1] 512   1
prob = as.array(executor$ref.outputs$softmax_output)
cat(paste0(label_names[which.max(prob)], ': ', formatC(max(prob), 4, format = 'f'), '\n'))
## n01818515 macaw: 0.9956

練習1:利用features重現預測結果

– 你可能需要從「res_model」裡叫出權重:

PARAMS <- res_model$arg.params
ls(PARAMS)
##  [1] "bn_data_beta"              "bn_data_gamma"            
##  [3] "bn0_beta"                  "bn0_gamma"                
##  [5] "bn1_beta"                  "bn1_gamma"                
##  [7] "conv0_weight"              "fc1_bias"                 
##  [9] "fc1_weight"                "stage1_unit1_bn1_beta"    
## [11] "stage1_unit1_bn1_gamma"    "stage1_unit1_bn2_beta"    
## [13] "stage1_unit1_bn2_gamma"    "stage1_unit1_conv1_weight"
## [15] "stage1_unit1_conv2_weight" "stage1_unit1_sc_weight"   
## [17] "stage1_unit2_bn1_beta"     "stage1_unit2_bn1_gamma"   
## [19] "stage1_unit2_bn2_beta"     "stage1_unit2_bn2_gamma"   
## [21] "stage1_unit2_conv1_weight" "stage1_unit2_conv2_weight"
## [23] "stage2_unit1_bn1_beta"     "stage2_unit1_bn1_gamma"   
## [25] "stage2_unit1_bn2_beta"     "stage2_unit1_bn2_gamma"   
## [27] "stage2_unit1_conv1_weight" "stage2_unit1_conv2_weight"
## [29] "stage2_unit1_sc_weight"    "stage2_unit2_bn1_beta"    
## [31] "stage2_unit2_bn1_gamma"    "stage2_unit2_bn2_beta"    
## [33] "stage2_unit2_bn2_gamma"    "stage2_unit2_conv1_weight"
## [35] "stage2_unit2_conv2_weight" "stage3_unit1_bn1_beta"    
## [37] "stage3_unit1_bn1_gamma"    "stage3_unit1_bn2_beta"    
## [39] "stage3_unit1_bn2_gamma"    "stage3_unit1_conv1_weight"
## [41] "stage3_unit1_conv2_weight" "stage3_unit1_sc_weight"   
## [43] "stage3_unit2_bn1_beta"     "stage3_unit2_bn1_gamma"   
## [45] "stage3_unit2_bn2_beta"     "stage3_unit2_bn2_gamma"   
## [47] "stage3_unit2_conv1_weight" "stage3_unit2_conv2_weight"
## [49] "stage4_unit1_bn1_beta"     "stage4_unit1_bn1_gamma"   
## [51] "stage4_unit1_bn2_beta"     "stage4_unit1_bn2_gamma"   
## [53] "stage4_unit1_conv1_weight" "stage4_unit1_conv2_weight"
## [55] "stage4_unit1_sc_weight"    "stage4_unit2_bn1_beta"    
## [57] "stage4_unit2_bn1_gamma"    "stage4_unit2_bn2_beta"    
## [59] "stage4_unit2_bn2_gamma"    "stage4_unit2_conv1_weight"
## [61] "stage4_unit2_conv2_weight"
cat(paste0(label_names[which.max(prob)], ': ', formatC(max(prob), 4, format = 'f'), '\n'))
## n01818515 macaw: 0.9956

練習1答案

# FullyConnected
FC_COEF = PARAMS$fc1_weight %>% as.array
FC_BIAS = PARAMS$fc1_bias %>% as.array
FC1_out = t(features)%*%FC_COEF + FC_BIAS

# Softmax
new.prob <- exp(FC1_out)/sum(exp(FC1_out))
cat(paste0(label_names[which.max(new.prob)], ': ', formatC(max(new.prob), 4, format = 'f'), '\n'))
## n01818515 macaw: 0.9956

轉移特徵學習(1)

– 而剩下的兩個問題中過度擬合問題有眾多可行的解決方案,或者是我們可以取得更大量的資料解決問題。然而權重初始化問題一直沒有辦法被解決。

– 這個想法稱作轉移特徵學習(Transfer learning),而這個想法是基於人類通常具有舉一反三的能力,舉例來說一個剛入學的醫學系學生他們僅有接受過高中程度的基礎訓練,並未接受過任何醫學專業領域的訓練,但他們的學習因為是基於高中的基礎之上,因此即使醫學專業相當艱深也能相當快的學會。

F8_39

轉移特徵學習(2)

– 讓我們做個小實驗來看看,用剛剛的resnet-18抽取鸚鵡圖並看看第一層的特徵圖長什麼樣子:

my_model <- res_model

my_model$symbol <- relu0
my_model$arg.params <- my_model$arg.params[names(my_model$arg.params) %in% names(mx.symbol.infer.shape(relu0, data = c(224, 224, 3, 7))$arg.shapes)]
my_model$aux.params <- my_model$aux.params[names(my_model$aux.params) %in% names(mx.symbol.infer.shape(relu0, data = c(224, 224, 3, 7))$aux.shapes)]

features <- predict(my_model, X = normed, ctx = mx.cpu())

#Display image

eps = 1e-8
par(mar=rep(0,4), mfrow = c(3, 3))
for (i in 1:9) {
  plot(NA, xlim = c(0.04, 0.96), ylim = c(0.04, 0.96), xaxt = "n", yaxt = "n", bty = "n")
  feature_IMG <- t(features[,,i,])
  feature_IMG <- feature_IMG/(max(feature_IMG) + eps)
  rasterImage(feature_IMG, 0, 0, 1, 1, interpolate=FALSE)
}

par(mar=rep(0,4), mfrow = c(8, 8))
for (i in 1:64) {
  plot(NA, xlim = 0:1, ylim = 0:1, xaxt = "n", yaxt = "n", bty = "n")
  feature_IMG <- t(features[,,i,])
  feature_IMG <- feature_IMG/max(feature_IMG)
  rasterImage(as.cimg(as.array(my_model$arg.params$conv0_weight)[,,,i]), 0, 0, 1, 1, interpolate=FALSE)
}

轉移特徵學習(3)

– 舉例來說,我們可以定義我們要整個resnet-18除了最後一個全連接層外的所有結構,只把最後一層的FC從分1000類轉變成分2類:

#Get symbol

all_layers = res_sym$get.internals()
flatten0_output = which(all_layers$outputs == 'flatten0_output') %>% all_layers$get.output()

fc1 <- mx.symbol.FullyConnected(data = flatten0_output, num_hidden = 2, name = 'fc1')
softmax <- mx.symbol.softmax(data = fc1, axis = 1, name = 'softmax')

label = mx.symbol.Variable(name = 'label')

eps = 1e-8
m_log = 0 - mx.symbol.mean(mx.symbol.broadcast_mul(mx.symbol.log(softmax + eps), label))
m_logloss = mx.symbol.MakeLoss(m_log, name = 'm_logloss')

– 接著,我們之前在開始訓練時需要初始化所有參數,我們可以將最後一層以外的部分填入resnet-18的參數,並繼續訓練任務:

mx.set.seed(0)
new_arg = mxnet:::mx.model.init.params(symbol = m_logloss,
                                       input.shape = list(data = c(224, 224, 3, 7), label = c(2, 7)),
                                       output.shape = NULL,
                                       initializer = mxnet:::mx.init.uniform(0.01),
                                       ctx = mx.cpu())

for (i in 1:length(new_arg$arg.params)) {
  pos <- which(names(res_model$arg.params) == names(new_arg$arg.params)[i])
  if (all.equal(dim(res_model$arg.params[[pos]]), dim(new_arg$arg.params[[i]])) == TRUE) {
    new_arg$arg.params[[i]] <- res_model$arg.params[[pos]]
  }
}

for (i in 1:length(new_arg$aux.params)) {
  pos <- which(names(res_model$aux.params) == names(new_arg$aux.params)[i])
  if (all.equal(dim(res_model$aux.params[[pos]]), dim(new_arg$aux.params[[i]])) == TRUE) {
    new_arg$aux.params[[i]] <- res_model$aux.params[[pos]]
  }
}
batch_size = 20

#1. Build an executor to train model
my_executor = mx.simple.bind(symbol = m_logloss,
                             data = c(224, 224, 3, batch_size), label = c(2, batch_size),
                             ctx = mx.cpu(), grad.req = "write")

#2. Set the initial parameters
mx.exec.update.arg.arrays(my_executor, new_arg$arg.params, match.name = TRUE)
mx.exec.update.aux.arrays(my_executor, new_arg$aux.params, match.name = TRUE)

練習2:用轉移特徵學習進行貓狗分類任務

– 讓我們到這裡下載其中的100張貓以及100張狗,最後再用這個分類器預測裡面貓狗各5張測試圖片。

– 這個任務對於你的程式能力會有些挑戰,這些貓狗圖片都完全沒有經過前處理並且有任何角度的照片,你的目標是僅僅使用各100張貓狗圖片就能訓練出一個有一定準確度的模型。

– 如果你有空的話,我們可以比較下列3種情形的準確度差異:

  1. 直接使用原始的resnet-18進行預測(ImageNet裡面本來就有貓跟狗的圖片,假設他有猜到貓跟狗的Label就算他對)

  2. 使用轉移特徵學習初始化權重,並使用resnet-18的架構進行訓練,最後再來預測

  3. 使用隨機初始化的權重,並使用resnet-18的架構進行訓練,最後再來預測

練習2答案(1)

library(imager)
library(magrittr)

# Define image processing functions

preproc.image <- function(im) {
  resized <- resize(im, 224, 224)
  resized <- as.array(resized) * 255
  # Reshape to format needed by mxnet (width, height, channel, num)
  dim(resized) <- c(224, 224, 3, 1)
  return(resized)
}

# Read data

Train_img.array <- array(0, dim = c(224, 224, 3, 200))
Train_Y.array <- array(t(model.matrix(~ -1 + factor(rep(1:2, 100)))), dim = c(2, 200))

for (i in 1:100) {
  cat_img <- load.image(paste0('Dogs vs. Cats/cat.', i, '.jpg'))
  Train_img.array[,,,(i-1)*2 + 1] <- preproc.image(cat_img)
  dog_img <- load.image(paste0('Dogs vs. Cats/dog.', i, '.jpg'))
  Train_img.array[,,,i*2] <- preproc.image(dog_img)
}
library(mxnet)

# Iterator

my_iterator_core = function(batch_size) {
  
  batch = 0
  batch_per_epoch = ncol(Train_Y.array)/batch_size
  
  reset = function() {batch <<- 0}
  
  iter.next = function() {
    batch <<- batch+1
    if (batch > batch_per_epoch) {return(FALSE)} else {return(TRUE)}
  }
  
  value = function() {
    idx = 1:batch_size + (batch - 1) * batch_size
    idx[idx > ncol(Train_Y.array)] = sample(1:ncol(Train_Y.array), sum(idx > ncol(Train_Y.array)))
    data = mx.nd.array(Train_img.array[,,,idx])
    label = mx.nd.array(Train_Y.array[,idx])
    return(list(data = data, label = label))
  }
  
  return(list(reset = reset, iter.next = iter.next, value = value, batch_size = batch_size, batch = batch))
}

my_iterator_func <- setRefClass("Custom_Iter",
                                fields = c("iter", "batch_size"),
                                contains = "Rcpp_MXArrayDataIter",
                                methods = list(
                                  initialize = function(iter, batch_size = 100){
                                    .self$iter <- my_iterator_core(batch_size = batch_size)
                                    .self
                                  },
                                  value = function(){
                                    .self$iter$value()
                                  },
                                  iter.next = function(){
                                    .self$iter$iter.next()
                                  },
                                  reset = function(){
                                    .self$iter$reset()
                                  },
                                  finalize=function(){
                                  }
                                )
)

my_iter = my_iterator_func(iter = NULL, batch_size = 10)

# Optimizer

my_optimizer = mx.opt.create(name = "sgd", learning.rate = 0.05, momentum = 0.9, wd = 0)

練習2答案(2)

# Read Pre-training Model

res_model = mx.model.load("model/resnet-18", 0)
res_sym = mx.symbol.load("model/resnet-18-symbol.json")

# Get symbol

all_layers = res_sym$get.internals()
flatten0_output = which(all_layers$outputs == 'flatten0_output') %>% all_layers$get.output()

# Define Model Architecture

fc1 <- mx.symbol.FullyConnected(data = flatten0, num_hidden = 2, name = 'fc1')
softmax <- mx.symbol.softmax(data = fc1, axis = 1, name = 'softmax')

label = mx.symbol.Variable(name = 'label')

eps = 1e-8
m_log = 0 - mx.symbol.mean(mx.symbol.broadcast_mul(mx.symbol.log(softmax + eps), label))
m_logloss = mx.symbol.MakeLoss(m_log, name = 'm_logloss')
mx.set.seed(0)
new_arg = mxnet:::mx.model.init.params(symbol = m_logloss,
                                       input.shape = list(data = c(224, 224, 3, 7), label = c(2, 7)),
                                       output.shape = NULL,
                                       initializer = mxnet:::mx.init.uniform(0.01),
                                       ctx = mx.cpu())

for (i in 1:length(new_arg$arg.params)) {
  pos <- which(names(res_model$arg.params) == names(new_arg$arg.params)[i])
  if (all.equal(dim(res_model$arg.params[[pos]]), dim(new_arg$arg.params[[i]])) == TRUE) {
    new_arg$arg.params[[i]] <- res_model$arg.params[[pos]]
  }
}

for (i in 1:length(new_arg$aux.params)) {
  pos <- which(names(res_model$aux.params) == names(new_arg$aux.params)[i])
  if (all.equal(dim(res_model$aux.params[[pos]]), dim(new_arg$aux.params[[i]])) == TRUE) {
    new_arg$aux.params[[i]] <- res_model$aux.params[[pos]]
  }
}
#1. Build an executor to train model

my_executor = mx.simple.bind(symbol = m_logloss,
                             data = c(224, 224, 3, 10), label = c(2, 10),
                             ctx = mx.cpu(), grad.req = "write")

#2. Set the initial parameters

mx.exec.update.arg.arrays(my_executor, new_arg$arg.params, match.name = TRUE)
mx.exec.update.aux.arrays(my_executor, new_arg$aux.params, match.name = TRUE)

#3. Define the updater

my_updater = mx.opt.get.updater(optimizer = my_optimizer, weights = my_executor$ref.arg.arrays)
for (i in 1:3) {
  
  my_iter$reset()
  batch_loss = NULL
  
  while (my_iter$iter.next()) {
    
    my_values <- my_iter$value()
    mx.exec.update.arg.arrays(my_executor, arg.arrays = my_values, match.name = TRUE)
    mx.exec.forward(my_executor, is.train = TRUE)
    mx.exec.backward(my_executor)
    update_args = my_updater(weight = my_executor$ref.arg.arrays, grad = my_executor$ref.grad.arrays)
    mx.exec.update.arg.arrays(my_executor, update_args, skip.null = TRUE)
    batch_loss = c(batch_loss, as.array(my_executor$ref.outputs$m_logloss_output))
    
  }
  
  message(paste0("epoch = ", i, ": m-logloss = ", formatC(mean(batch_loss), format = "f", 4)))
  
}

練習2答案(3)

# Get model

dog_cat_model <- mxnet:::mx.model.extract.model(symbol = softmax,
                                                train.execs = list(my_executor))

# Predict & Display

par(mar=rep(0,4), mfcol = c(2, 5))

for (i in 1:5) {
  
  plot(NA, xlim = c(0.04, 0.96), ylim = c(0.04, 0.96), xaxt = "n", yaxt = "n", bty = "n")
  cat_img <- load.image(paste0('Dogs vs. Cats/test_cat.', i, '.jpg'))
  norm_cat_img <- preproc.image(cat_img)
  rasterImage(cat_img, 0, 0, 1, 1, interpolate=FALSE)
  prob <- predict(dog_cat_model, X = norm_cat_img, ctx = mx.cpu())
  text(0.5, 0.95, formatC(prob[1,1], 3, format = 'f'), col = "green", cex = 2)
  
  plot(NA, xlim = c(0.04, 0.96), ylim = c(0.04, 0.96), xaxt = "n", yaxt = "n", bty = "n")
  dog_img <- load.image(paste0('Dogs vs. Cats/test_dog.', i, '.jpg'))
  norm_dog_img <- preproc.image(dog_img)
  rasterImage(dog_img, 0, 0, 1, 1, interpolate=FALSE)
  prob <- predict(dog_cat_model, X = norm_dog_img, ctx = mx.cpu())
  text(0.5, 0.95, formatC(prob[1,1], 3, format = 'f'), col = "green", cex = 2)
  
}

練習2答案(4)

Fixed_NAMES = names(res_model$arg.params)[names(res_model$arg.params) %in% names(mx.symbol.infer.shape(flatten0_output, data = c(224, 224, 3, 10))$arg.shapes)]

#1. Build an executor to train model

my_executor = mx.simple.bind(symbol = m_logloss, fixed.param = Fixed_NAMES,
                             data = c(224, 224, 3, 10), label = c(2, 10),
                             ctx = mx.cpu(), grad.req = "write")

#2. Set the initial parameters

mx.exec.update.arg.arrays(my_executor, new_arg$arg.params, match.name = TRUE)
mx.exec.update.aux.arrays(my_executor, new_arg$aux.params, match.name = TRUE)

#3. Define the updater

my_updater = mx.opt.get.updater(optimizer = my_optimizer, weights = my_executor$ref.arg.arrays)
for (i in 1:3) {
  
  my_iter$reset()
  batch_loss = NULL
  
  while (my_iter$iter.next()) {
    
    my_values <- my_iter$value()
    mx.exec.update.arg.arrays(my_executor, arg.arrays = my_values, match.name = TRUE)
    mx.exec.forward(my_executor, is.train = TRUE)
    mx.exec.backward(my_executor)
    update_args = my_updater(weight = my_executor$ref.arg.arrays, grad = my_executor$ref.grad.arrays)
    mx.exec.update.arg.arrays(my_executor, update_args, skip.null = TRUE)
    batch_loss = c(batch_loss, as.array(my_executor$ref.outputs$m_logloss_output))
    
  }
  
  message(paste0("epoch = ", i, ": m-logloss = ", formatC(mean(batch_loss), format = "f", 4)))
  
}

– 使用固定參數需要在Get model時再將固定的參數放入model中:

# Get model

dog_cat_model <- mxnet:::mx.model.extract.model(symbol = softmax,
                                                train.execs = list(my_executor))

dog_cat_model$arg.params <- append(dog_cat_model$arg.params, res_model$arg.params[names(res_model$arg.params) %in%Fixed_NAMES])

# Predict & Display

par(mar=rep(0,4), mfcol = c(2, 5))

for (i in 1:5) {
  
  plot(NA, xlim = c(0.04, 0.96), ylim = c(0.04, 0.96), xaxt = "n", yaxt = "n", bty = "n")
  cat_img <- load.image(paste0('Dogs vs. Cats/test_cat.', i, '.jpg'))
  norm_cat_img <- preproc.image(cat_img)
  rasterImage(cat_img, 0, 0, 1, 1, interpolate=FALSE)
  prob <- predict(dog_cat_model, X = norm_cat_img, ctx = mx.cpu())
  text(0.5, 0.95, formatC(prob[1,1], 3, format = 'f'), col = "green", cex = 2)
  
  plot(NA, xlim = c(0.04, 0.96), ylim = c(0.04, 0.96), xaxt = "n", yaxt = "n", bty = "n")
  dog_img <- load.image(paste0('Dogs vs. Cats/test_dog.', i, '.jpg'))
  norm_dog_img <- preproc.image(dog_img)
  rasterImage(dog_img, 0, 0, 1, 1, interpolate=FALSE)
  prob <- predict(dog_cat_model, X = norm_dog_img, ctx = mx.cpu())
  text(0.5, 0.95, formatC(prob[1,1], 3, format = 'f'), col = "green", cex = 2)
  
}

結語

– 深度學習的三大經典理論問題(過度擬合問題、梯度消失問題、權重初始化問題)我們都已經大致上學會了該如何應對,並且我們都已經有一些基礎的能力編寫程式訓練出一個AI模型進行圖像分類。

– 這裡要派給各位一個回家作業,作業內容是到Kaggle上的Dogs vs. Cats下載全部的Training data,並且預測測試集的資料後將答案投稿至Kaggle上看看自己的排名。

F8_40

– 這是利用剛剛100個樣本訓練3代後所得到的分數:

F8_41

– 在這個作業中,之前課程中所學到的所有技巧都「有機會」增加準確度。你可以先上網下載一個你最喜歡的經典模型,並以他為基礎進行轉移特徵學習,之後運用所有我們已經學到防止過擬合的所有手段,最後再看看準確度如何!